#Introduction Section:
This synthetic dataset simulates real-world lung cancer cases, including demographics, medical history, treatments, and outcomes. It supports predictive modeling, prognosis assessment, and treatment analysis in research.
How can patient age, medical history and tumor characteristics,predict the stage of lung cancer at diagnosis? And how patients cluster by stage?
Patient_ID Age Gender
Length:23658 Min. :30.00 Length:23658
Class :character 1st Qu.:42.00 Class :character
Mode :character Median :54.00 Mode :character
Mean :54.44
3rd Qu.:67.00
Max. :79.00
Smoking_History Tumor_Size_mm Tumor_Location
Length:23658 Min. :10.00 Length:23658
Class :character 1st Qu.:32.97 Class :character
Mode :character Median :55.30 Mode :character
Mean :55.38
3rd Qu.:78.19
Max. :99.99
Stage Treatment Survival_Months
Length:23658 Length:23658 Min. : 1.00
Class :character Class :character 1st Qu.: 30.00
Mode :character Mode :character Median : 60.00
Mean : 59.86
3rd Qu.: 89.00
Max. :119.00
Ethnicity Insurance_Type Family_History
Length:23658 Length:23658 Length:23658
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Comorbidity_Diabetes Comorbidity_Hypertension
Length:23658 Length:23658
Class :character Class :character
Mode :character Mode :character
Comorbidity_Heart_Disease Comorbidity_Chronic_Lung_Disease
Length:23658 Length:23658
Class :character Class :character
Mode :character Mode :character
Comorbidity_Kidney_Disease Comorbidity_Autoimmune_Disease
Length:23658 Length:23658
Class :character Class :character
Mode :character Mode :character
Comorbidity_Other Performance_Status Blood_Pressure_Systolic
Length:23658 Min. :0 Min. : 90.0
Class :character 1st Qu.:1 1st Qu.:112.0
Mode :character Median :2 Median :134.0
Mean :2 Mean :134.5
3rd Qu.:3 3rd Qu.:157.0
Max. :4 Max. :179.0
Blood_Pressure_Diastolic Blood_Pressure_Pulse Hemoglobin_Level
Min. : 60.00 Min. :60.00 Min. :10.00
1st Qu.: 72.00 1st Qu.:70.00 1st Qu.:11.99
Median : 85.00 Median :80.00 Median :13.98
Mean : 84.48 Mean :79.59 Mean :14.00
3rd Qu.: 97.00 3rd Qu.:90.00 3rd Qu.:16.00
Max. :109.00 Max. :99.00 Max. :18.00
White_Blood_Cell_Count Platelet_Count Albumin_Level
Min. : 3.501 Min. :150.0 Min. :3.000
1st Qu.: 5.109 1st Qu.:224.9 1st Qu.:3.505
Median : 6.730 Median :299.9 Median :4.000
Mean : 6.736 Mean :299.9 Mean :3.999
3rd Qu.: 8.354 3rd Qu.:375.4 3rd Qu.:4.499
Max. :10.000 Max. :450.0 Max. :5.000
Alkaline_Phosphatase_Level Alanine_Aminotransferase_Level
Min. : 30.01 Min. : 5.001
1st Qu.: 52.62 1st Qu.:13.816
Median : 75.09 Median :22.548
Mean : 75.03 Mean :22.505
3rd Qu.: 97.45 3rd Qu.:31.093
Max. :119.99 Max. :40.000
Aspartate_Aminotransferase_Level Creatinine_Level LDH_Level
Min. :10.00 Min. :0.5000 Min. :100.0
1st Qu.:20.07 1st Qu.:0.7488 1st Qu.:137.4
Median :30.27 Median :1.0012 Median :174.4
Mean :30.13 Mean :0.9995 Mean :174.7
3rd Qu.:40.11 3rd Qu.:1.2492 3rd Qu.:212.2
Max. :50.00 Max. :1.5000 Max. :250.0
Calcium_Level Phosphorus_Level Glucose_Level Potassium_Level
Min. : 8.000 Min. :2.500 Min. : 70.00 Min. :3.500
1st Qu.: 8.641 1st Qu.:3.120 1st Qu.: 89.83 1st Qu.:3.872
Median : 9.259 Median :3.731 Median :109.95 Median :4.242
Mean : 9.261 Mean :3.743 Mean :109.90 Mean :4.246
3rd Qu.: 9.883 3rd Qu.:4.364 3rd Qu.:130.06 3rd Qu.:4.618
Max. :10.500 Max. :5.000 Max. :150.00 Max. :5.000
Sodium_Level Smoking_Pack_Years
Min. :135.0 Min. : 0.0168
1st Qu.:137.5 1st Qu.: 25.0268
Median :140.0 Median : 49.9262
Mean :140.0 Mean : 49.9136
3rd Qu.:142.5 3rd Qu.: 74.9246
Max. :145.0 Max. : 99.9995
Age Tumor_Size_mm Stage Survival_Months
Min. :30.00 Min. :10.00 Min. :1.000 Min. : 1.00
1st Qu.:42.00 1st Qu.:32.97 1st Qu.:2.000 1st Qu.: 30.00
Median :54.00 Median :55.30 Median :3.000 Median : 60.00
Mean :54.44 Mean :55.38 Mean :2.509 Mean : 59.86
3rd Qu.:67.00 3rd Qu.:78.19 3rd Qu.:4.000 3rd Qu.: 89.00
Max. :79.00 Max. :99.99 Max. :4.000 Max. :119.00
Performance_Status Blood_Pressure_Systolic
Min. :0 Min. : 90.0
1st Qu.:1 1st Qu.:112.0
Median :2 Median :134.0
Mean :2 Mean :134.5
3rd Qu.:3 3rd Qu.:157.0
Max. :4 Max. :179.0
Blood_Pressure_Diastolic Blood_Pressure_Pulse Hemoglobin_Level
Min. : 60.00 Min. :60.00 Min. :10.00
1st Qu.: 72.00 1st Qu.:70.00 1st Qu.:11.99
Median : 85.00 Median :80.00 Median :13.98
Mean : 84.48 Mean :79.59 Mean :14.00
3rd Qu.: 97.00 3rd Qu.:90.00 3rd Qu.:16.00
Max. :109.00 Max. :99.00 Max. :18.00
White_Blood_Cell_Count Platelet_Count Albumin_Level
Min. : 3.501 Min. :150.0 Min. :3.000
1st Qu.: 5.109 1st Qu.:224.9 1st Qu.:3.505
Median : 6.730 Median :299.9 Median :4.000
Mean : 6.736 Mean :299.9 Mean :3.999
3rd Qu.: 8.354 3rd Qu.:375.4 3rd Qu.:4.499
Max. :10.000 Max. :450.0 Max. :5.000
Alkaline_Phosphatase_Level Alanine_Aminotransferase_Level
Min. : 30.01 Min. : 5.001
1st Qu.: 52.62 1st Qu.:13.816
Median : 75.09 Median :22.548
Mean : 75.03 Mean :22.505
3rd Qu.: 97.45 3rd Qu.:31.093
Max. :119.99 Max. :40.000
Aspartate_Aminotransferase_Level Creatinine_Level LDH_Level
Min. :10.00 Min. :0.5000 Min. :100.0
1st Qu.:20.07 1st Qu.:0.7488 1st Qu.:137.4
Median :30.27 Median :1.0012 Median :174.4
Mean :30.13 Mean :0.9995 Mean :174.7
3rd Qu.:40.11 3rd Qu.:1.2492 3rd Qu.:212.2
Max. :50.00 Max. :1.5000 Max. :250.0
Calcium_Level Phosphorus_Level Glucose_Level Potassium_Level
Min. : 8.000 Min. :2.500 Min. : 70.00 Min. :3.500
1st Qu.: 8.641 1st Qu.:3.120 1st Qu.: 89.83 1st Qu.:3.872
Median : 9.259 Median :3.731 Median :109.95 Median :4.242
Mean : 9.261 Mean :3.743 Mean :109.90 Mean :4.246
3rd Qu.: 9.883 3rd Qu.:4.364 3rd Qu.:130.06 3rd Qu.:4.618
Max. :10.500 Max. :5.000 Max. :150.00 Max. :5.000
Sodium_Level Smoking_Pack_Years
Min. :135.0 Min. : 0.0168
1st Qu.:137.5 1st Qu.: 25.0268
Median :140.0 Median : 49.9262
Mean :140.0 Mean : 49.9136
3rd Qu.:142.5 3rd Qu.: 74.9246
Max. :145.0 Max. : 99.9995
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 23658 individuals, described by 23 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 23658 individuals, described by 23 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
eigenvalue percentage of variance
comp 1 1.059069 4.604649
comp 2 1.051125 4.570111
comp 3 1.043533 4.537100
comp 4 1.034462 4.497659
comp 5 1.030395 4.479977
comp 6 1.024297 4.453467
#Results Section:
#Key findings: 1.Age Distribution Across Stages: - Exploratory analysis showed that the age of patients is widely distributed across all tumor stages, with no significant clustering or trend of older patients in higher stages.
#Discussion: ## Testable Hypothesis: Factors beyond smoking history and age, such as genetic mutations or environmental exposures, are stronger predictors of lung cancer tumor stage at diagnosis.
#To test this hypothesis, future research could: - Incorporate data on genetic markers or environmental pollutants. - Use multivariate regression or machine learning models to assess the combined impact of smoking, age, genetics, and environment on tumor stage.